Assessing Read Quality

Time
  • Teaching: 20 min
  • Exercises: 15 min
Questions
  • How can I describe the quality of my data?
Objectives
  • Explain how a FASTQ file encodes per-base quality scores
  • Interpret a FastQC plot summarizing per-base quality across all reads

Bioinformatic workflows

When working with high-throughput sequencing data, the raw reads you get off the sequencer must pass through several different tools to generate your final desired output. The execution of this set of tools in a specified order is commonly referred to as a workflow or a pipeline.

An example of the workflow we will be using for our analysis is provided below, with a brief description of each step.

Metagenomics Workflow
  1. Quality control - Assessing quality using FastQC and Trimming and/or filtering reads (if necessary)
  2. Assembly of metagenome
  3. Binning
  4. Taxonomic assignation

These workflows in bioinformatics adopt a plug-and-play approach in that the output of one tool can be easily used as input to another tool without any extensive configuration. Having standards for data formats is what makes this feasible. Standards ensure that data is stored in a way that is generally accepted and agreed upon within the community. Therefore, the tools used to analyze data at different workflow stages are built, assuming that the data will be provided in a specific format.

Quality Control

We will now assess the quality of the sequence reads contained in our FASTQ files.

Sequence Reads QC

Details on the FASTQ format

Although it looks complicated (and it is), we can understand the FASTQ format with a little decoding. Some rules about the format include the following:

Line Description
1 Always begins with ‘@’ followed by the information about the read
2 The actual DNA sequence
3 Always begins with a ‘+’ and sometimes contains the same info as in line 1
4 Has a string of characters which represent the quality scores; must have same number of characters as line 2

Below is an example of what a single read looks like in a FASTQ file.

1   @MISEQ-LAB244-W7:156:000000000-A80CV:1:1101:12622:2006 1:N:0:CTCAGA
2   CCCGTTCCTCGGGCGTGCAGTCGGGCTTGCGGTCTGCCATGTCGTGTTCGGCGTCGGTGGTGCCGATCAGGGTGAAATCCGTCTCGTAGGGGATCGCGAAGATGATCCGCCCGTCCGTGCCCTGAAAGAAATAGCACTTGTCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCTCAGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAGCAAACCTCTCACTCCCTCTACTCTACTCCCTT                                        
3   +                                                                                                
4   A>>1AFC>DD111A0E0001BGEC0AEGCCGEGGFHGHHGHGHHGGHHHGGGGGGGGGGGGGHHGEGGGHHHHGHHGHHHGGHHHHGGGGGGGGGGGGGGGGHHHHHHHGGGGGGGGHGGHHHHHHHHGFHHFFGHHHHHGGGGGGGGGGGGGGGGGGGGGGGGGGGGFFFFFFFFFFFFFFFFFFFFFBFFFF@F@FFFFFFFFFFBBFF?@;@#################################### 

Line 4 shows the quality of each nucleotide in the read. Quality is interpreted as the probability of an incorrect base call (e.g., 1 in 10) or, equivalently, the base call accuracy (e.g., 90%). Each nucleotide’s numerical score value is converted into a character code where every single character represents a quality score for an individual nucleotide. This conversion allows the alignment of each individual nucleotide with its quality score. For example, in the line above, the quality score line is:

A>>1AFC>DD111A0E0001BGEC0AEGCCGEGGFHGHHGHGHHGGHHHGGGGGGGGGGGGGHHGEGGGHHHHGHHGHHHGGHHHHGGGGGGGGGGGGGGGGHHHHHHHGGGGGGGGHGGHHHHHHHHGFHHFFGHHHHHGGGGGGGGGGGGGGGGGGGGGGGGGGGGFFFFFFFFFFFFFFFFFFFFFBFFFF@F@FFFFFFFFFFBBFF?@;@#################################### 

The numerical value assigned to each character depends on the sequencing platform that generated the reads. The sequencing machine used to generate our data uses the standard Sanger quality PHRED score encoding, using Illumina version 1.8 onwards. Each character is assigned a quality score between 0 and 41, as shown in the chart below.

Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
                   |         |         |         |         |
Quality score:    01........11........21........31........41                                

Each quality score represents the probability that the corresponding nucleotide call is incorrect. These probability values are the results of the base calling algorithm and depend on how much signal was captured for the base incorporation. This quality score is logarithmically based, so a quality score of 10 reflects a base call accuracy of 90%, but a quality score of 20 reflects a base call accuracy of 99%. In this link you can find more information about quality scores.

Looking back at our read:

@MISEQ-LAB244-W7:156:000000000-A80CV:1:1101:12622:2006 1:N:0:CTCAGA
CCCGTTCCTCGGGCGTGCAGTCGGGCTTGCGGTCTGCCATGTCGTGTTCGGCGTCGGTGGTGCCGATCAGGGTGAAATCCGTCTCGTAGGGGATCGCGAAGATGATCCGCCCGTCCGTGCCCTGAAAGAAATAGCACTTGTCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCTCAGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAGCAAACCTCTCACTCCCTCTACTCTACTCCCTT                                        
+                                                                                                
A>>1AFC>DD111A0E0001BGEC0AEGCCGEGGFHGHHGHGHHGGHHHGGGGGGGGGGGGGHHGEGGGHHHHGHHGHHHGGHHHHGGGGGGGGGGGGGGGGHHHHHHHGGGGGGGGHGGHHHHHHHHGFHHFFGHHHHHGGGGGGGGGGGGGGGGGGGGGGGGGGGGFFFFFFFFFFFFFFFFFFFFFBFFFF@F@FFFFFFFFFFBBFF?@;@#################################### 

We can now see that there is a range of quality scores but that the end of the sequence is very poor (# = a quality score of 2).

In real life, you won’t be assessing the quality of your reads by visually inspecting your FASTQ files. Instead, you’ll use a software program to assess read quality and filter out poor reads. We’ll first use a program called FastQC to visualize the quality of our reads. Later in our workflow, we’ll use another program to filter out poor-quality reads.

Working with Galaxy

Throughout this workshop we are going to be using a web-based analysis platform called Galaxy. Galaxy is a community-driven web-based analysis platform for life science research. Galaxy was built to allow any research to carry out data analysis, using tools that normally require command-line knowledge. As such, no programming knowledge is required to access Galaxy and its tools with a web browser.

The Galaxy EU platform, which we will utilise in this workshop, has numerous subdomains that are dedicated to a particular area of life science research, such as metagenomics. Therefore, all the work carried out here will be within the Metagenomics Galaxy EU subdomain.

Obtain the Data

At the start of this session you were given SRA accession numbers. These are the reads you will be processing today.

First thing we need to do is load our reads into Galaxy. We will be doing this using a shared Galaxy History, however in reality for this dataset, you would use a built-in tool called Download and Extract Reads in FASTQ format from NCBI SRA, therefore for your information, the steps are outlined below.

To do this, use the Tools panel and open up the Get Data section. Within this section you will find the tool Download and Extract Reads in FASTQ format from NCBI SRA, as highlighted in the figure below.

Obtaining Reads from NCBI SRA

Enter the SRA accession number in the Accession box. Default parameters are fine, therefore we select Run Tool

Once the reads are loaded, the History panel will list them with a green background. Items in the history panel with a yellow background are being processed, grey background means the job is queued and red means the job has failed.

If we click on the Show Hidden icon, then our paired-end reads will show as forward and reverse in the history panel. We can then click the hidden icon for each set of reads, to have them appear for subsequent analyses. These are the reads we will be working with.

Importing a shared Galaxy History is simpler. To obtain the data use the following link:

https://metagenomics.usegalaxy.eu/u/alisonmacfadyentsl/h/tsl-summer-conference

This will open up Metagenomics Galaxy and allow you to import the history.

Import Shared History

The Galaxy page that loads will show the history that has been shared. Click the button at the top that says Import this History.

A pop-up will then appear, giving you the option to rename the history and to chose whether you want to copy all datasets, including deleted one. Select Copy only the active, non-deleted datasets and select Copy History.

Copy History Pop-up

You can then go to the Galaxy Homepage/refresh your page and you will see that your History has been populated with the data.

To simplify your history, you can delete the datasets you do not need e.g. delete all datasets except the SRA numbers you have been given. This can be achieved by clicking the “Trashcan” icon.

Populated History and Button for Deletion

Assessing quality using FastQC

FastQC has several features that can give you a quick impression of any problems your data may have, so you can consider these issues before moving forward with your analyses. Rather than looking at quality scores for each read, FastQC looks at quality collectively across all reads within a sample. The image below shows one FastQC-generated plot that indicates a very high-quality sample:

High Quality Sample

The x-axis displays the base position in the read, and the y-axis shows quality scores. In this example, the sample contains reads that are 40 bp long. This length is much shorter than the reads we are working on within our workflow. For each position, there is a box-and-whisker plot showing the distribution of quality scores for all reads at that position. The horizontal red line indicates the median quality score, and the yellow box shows the 1st to 3rd quartile range. This range means that 50% of reads have a quality score that falls within the range of the yellow box at that position. The whiskers show the whole range covering the lowest (0th quartile) to highest (4th quartile) values.

The quality values for each position in this sample do not drop much lower than 32, which is a high-quality score. The plot background is also color-coded to identify good (green), acceptable (yellow) and bad (red) quality scores.

Now let’s look at a quality plot on the other end of the spectrum.

Low Quality Sample

The FastQC tool produces several other diagnostic plots to assess sample quality and the one plotted above. Here, we see positions within the read in which the boxes span a much more comprehensive range. Also, quality scores drop pretty low into the “bad” range, particularly on the tail end of the reads.

Running FastQC

We will now assess the quality of the reads that we downloaded.

Running FastQC with Multiple Datasets

In the Tools panel use the search box to find FastQC. Select the Multiple Datasets option and click on both the forward and reverse reads. We will use default parameters. Select Run Tools.

The job will then queue and appear grey, then change to yellow as it runs, changing to green upon completion:

FastQC queued and then running

Viewing the FastQC Results

Now we can select the Webpage output from FastQC to view the results. To view the results, select the History output that says FastQC on data X: Webpage and click the “eye” icon. This will open the FastQC results in an .html format.

Exercise 1: Discuss the quality of sequencing files

Discuss your results with a neighbour. Which sample(s) look the best per base sequence quality? Which sample(s) look the worst?

All of the reads contain usable data, but the quality decreases toward the end of the reads. Note our Metashotgun samples are a subset of the complete dataset. This has resulted in skewed results.

Decoding the other FastQC outputs

We’ve now looked at quite a few “Per base sequence quality” FastQC graphs, but there are nine other graphs that we haven’t talked about! Below we have provided a brief overview of interpretations for each plot. For more information, please see the FastQC documentation here

  • Per tile sequence quality: the machines that perform sequencing are divided into tiles. This plot displays patterns in base quality along these tiles. Consistently low scores are often found around the edges, but hot spots could also occur in the middle if an air bubble was introduced during the run.
  • Per sequence quality scores: a density plot of quality for all reads at all positions. This plot shows what quality scores are most common.
  • Per base sequence content: plots the proportion of each base position over all of the reads. Typically, we expect to see each base roughly 25% of the time at each position, but this often fails at the beginning or end of the read due to quality or adapter content.
  • Per sequence GC content: a density plot of average GC content in each of the reads.
  • Per base N content: the percent of times that ‘N’ occurs at a position in all reads. If there is an increase at a particular position, this might indicate that something went wrong during sequencing.
  • Sequence Length Distribution: the distribution of sequence lengths of all reads in the file. If the data is raw, there is often a sharp peak; however, if the reads have been trimmed, there may be a distribution of shorter lengths.
  • Sequence Duplication Levels: a distribution of duplicated sequences. In sequencing, we expect most reads to only occur once. If some sequences are occurring more than once, it might indicate enrichment bias (e.g. from PCR). This might not be true if the samples are high coverage (or RNA-seq or amplicon).
  • Overrepresented sequences: a list of sequences that occur more frequently than would be expected by chance.
  • Adapter Content: a graph indicating where adapter sequences occur in the reads.
  • K-mer Content: a graph showing any sequences which may show a positional bias within the reads.
Exercise 2: Quality Tests

Did your samples fail any of the FastQC’s quality tests? What test(s) did the samples fail?

Our metabarcoding samples failed:

  • Per base sequence content

  • Per sequence GC content

  • Sequence duplication levels

  • Overrepresented sequences

Quality Encodings Vary

Although we’ve used a particular quality encoding system to demonstrate the interpretation of read quality, different sequencing machines use different encoding systems. This means that depending on which sequencer you use to generate your data, a # may not indicate a poor quality base call.

This mainly relates to older Solexa/Illumina data. However, it’s essential that you know which sequencing platform was used to generate your data to tell your quality control program which encoding to use. If you choose the wrong encoding, you run the risk of throwing away good reads or (even worse) not throwing away bad reads!

Keypoints
  • It is important to know the quality of our data to make decisions in the subsequent steps
  • FastQC is a program that allows us to know the quality of FASTQ files